Manipulating and Summarizing Data

class: center, top, title-slide

# Manipulating and Summarizing Data
### Adam Kuczynski

---

# Today's Theme:
## "Data Engineer Work"

Issues around preparing a dataset for the analyses you want to run:

- Subsetting data
--

- Performing operations across rows and columns
--

- Creating new variables
--

- Creating rich summaries of your data
--

- Merging multiple datasets together

---
class: inverse
# Tibbles
<div style="text-align: center;">
  <img src="images/tibble.svg", alt="Tibble Logo" style="max-width: 60%;">
</div>
---
# What is a `tibble`?

From the tibble [webpage](https://tibble.tidyverse.org):
> "A **tibble**, or `tbl_df` [the official `class` of a tibble] is a modern reimagining of the data.frame, keeping what time has proben to be effective, and throwing out what is not. Tibbles are data.frames that are lazy and surly: they do less (i.e. they don't change variable names or types, and don't do partial matching) and complain more (e.g. when a variable does not exist). This forces you to confront problems earlier, typically leading to cleaner, more expressive code. Tibbles also have an enhanced print() method which makes them easier to use with large datasets containing complex objects."
--

Tibbles are part of the [Tidyverse](https://tidyverse.org) and operate in much the same way as dataframes (most of the time you don't need to worry about whether your object is a tibble or a dataframe)
--

Although most functions from the Tidyverse set of packages will operate on both dataframes and tibbles, some (e.g., `group_by()`) will return a tibble back

---
# Making Tibbles

.small[
There are three primary ways to make a tibble:

**1.** Convert a dataframe into a tibble using `tibble::as_tibble()`<sup>1</sup>

```r
mtcars <- as_tibble(mtcars)
class(mtcars)
```

```
## [1] "tbl_df"     "tbl"        "data.frame"
```

**2.** Use `tibble::tibble()`

```r
uwclinpsych <- tibble(name = c("Corey", "Angela", "Bill", "Mary", "Jane", "Lori"),
                      grads = c(1, 0, 4, 3, 2, 3),
                      fullprof = c(F, F, T, T, T, T))

head(uwclinpsych, 4)
```

```
## # A tibble: 4 x 3
##   name   grads fullprof
##   <chr>  <dbl> <lgl>   
## 1 Corey      1 FALSE   
## 2 Angela     0 FALSE   
## 3 Bill       4 TRUE    
## 4 Mary       3 TRUE
```

.footnote[[1] The `as_tibble()` function has been exported into the `tidyr` and `dplyr` package, so loading any of those will give you access to this function.]
]

---
**3.** use `tibble::tribble()` to construct a tibble row-wise. Column names are denoted with a `~` in front and are not quoted. Values are comma separated and rows are separated by a newline.

.smallish[

```r
uwclinpsych <- tribble(
  ~name,     ~grads,  ~fullprof,
  "Corey",   1,       FALSE,
  "Angela",  0,       FALSE,
  "Bill",    4,       TRUE,
  "Mary",    3,       TRUE,
  "Jane",    2,       TRUE,
  "Lori",    3,       TRUE
)

print(uwclinpsych)
```

```
## # A tibble: 6 x 3
##   name   grads fullprof
##   <chr>  <dbl> <lgl>   
## 1 Corey      1 FALSE   
## 2 Angela     0 FALSE   
## 3 Bill       4 TRUE    
## 4 Mary       3 TRUE    
## 5 Jane       2 TRUE    
## 6 Lori       3 TRUE
```
]
---
## A Nice Feature of `tibble()`

One really nice feature of constructing tibbles from scratch is that you can build on columns dynamically:

.smallish[
.pull-left[
`tibble()`

```r
tibble(nums = 1:10,
       lets = letters[1:10],
*      both = paste0(nums, lets))
```

```
## # A tibble: 10 x 3
##     nums lets  both 
##    <int> <chr> <chr>
##  1     1 a     1a   
##  2     2 b     2b   
##  3     3 c     3c   
##  4     4 d     4d   
##  5     5 e     5e   
##  6     6 f     6f   
##  7     7 g     7g   
##  8     8 h     8h   
##  9     9 i     9i   
## 10    10 j     10j
```
]

.pull-right[
`data.frame()`

```r
data.frame(nums = 1:10,
           lets = letters[1:10],
*          both = paste0(nums, lets))
```

```
## Error in paste0(nums, lets): object 'nums' not found
```

```r
df <- data.frame(nums = 1:10,
                 lets = letters[1:10])
df$both = paste0(df$nums, df$lets)
print(df)
```

👈 same output as `tibble()` except the class is dataframe

]
]
---
class: inverse
# Manipulating Data
<div style="text-align: center;">
  <img src="images/dplyr.svg", alt="Tibble Logo" style="max-width: 70%;">
</div>
---
# Starwars Data

To demonstrate much of `dplyr`'s functionality, we will use the `starwars` data that is loaded with `dplyr` and originally from [SWAPI](https://swapi.dev) (a Star Wars API)

.smallish[

```r
glimpse(starwars)
```

```
## Rows: 87
## Columns: 14
## $ name       <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Or…
## $ height     <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180, 2…
## $ mass       <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, 77.…
## $ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "brown", N…
## $ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "light", "…
## $ eye_color  <chr> "blue", "yellow", "red", "yellow", "brown", "blue", "blue",…
## $ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0, 57.0, …
## $ sex        <chr> "male", "none", "none", "male", "female", "male", "female",…
## $ gender     <chr> "masculine", "masculine", "masculine", "masculine", "femini…
## $ homeworld  <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan", "T…
## $ species    <chr> "Human", "Droid", "Droid", "Human", "Human", "Human", "Huma…
## $ films      <list> <"The Empire Strikes Back", "Revenge of the Sith", "Return…
## $ vehicles   <list> <"Snowspeeder", "Imperial Speeder Bike">, <>, <>, <>, "Imp…
## $ starships  <list> <"X-wing", "Imperial shuttle">, <>, <>, "TIE Advanced x1",…
```
]
---
## A Brief Reminder About Pipes

The `dplyr` package uses **verbs** to name the functions within. As a result, they work very nicely with the pipe (`%>%`) syntax

```r
take_these_data %>%
    do_first_thing(with = this_value) %>%
    do_next_thing(using = that_value) %>%
```

The LHS is passed as the *first argument* to the function on the RHS
--

You can reference the LHS with a `.` to use it in other places in the RHS function

```r
take_these_data %>%
  do_first_thing(argument = "Value", with = .) %>%
  do_next_thing(using = that_value) %>%
```

---
# `group_by()`

`group_by()` is a special function that controls the behavior of other functions as they operate on the data
--

It returns a tibble with the following classes: `grouped_df`, `tbl_df`, `tbl`, and `data.frame`
--

Most functions called on grouped data operate *within each group* rather than on the entire dataset
--

Data are typically grouped by variables that are characters, factors, or integers, not continuous data

---
.small[
For example, `group_by()` characters' `eye_color`

```r
starwars_grouped <- starwars %>%
  group_by(eye_color)

class(starwars_grouped)
```

```
## [1] "grouped_df" "tbl_df"     "tbl"        "data.frame"
```

Notice that this dataset has *exactly* the same data as the ungrouped version, except it now controls the output of other function calls

.pull-left[
**Not grouped**

```r
dim(starwars)
```

```
## [1] 87 14
```
]

.pull-right[
**Grouped**

```r
dim(starwars_grouped)
```

```
## [1] 87 14
```
]

To remove a grouping structure use the `ungroup()` function (if left blank, *all* grouping is removed, otherwise just the specified groups are ungrouped):

```r
starwars_ungrouped <- starwars_grouped %>%
  ungroup()

class(starwars_ungrouped)
```

```
## [1] "tbl_df"     "tbl"        "data.frame"
```
]
---
## `group_by()` example

.small[
**Mean mass by character gender:**

```
##  feminine masculine      none 
##  54.68889 106.14694  48.00000
```

```r
starwars %>%
  # Center mass by sample average
* mutate(mass_gmc = mass - mean(mass, na.rm = T)) %>%
* group_by(gender) %>%
  # Center mass by group average
* mutate(mass_pmc = mass - mean(mass, na.rm = T)) %>%
  select(name, gender, mass, mass_gmc, mass_pmc)
```

```
## # A tibble: 87 x 5
## # Groups:   gender [3]
##    name               gender     mass mass_gmc mass_pmc
##    <chr>              <chr>     <dbl>    <dbl>    <dbl>
##  1 Luke Skywalker     masculine    77    -20.3   -29.1 
##  2 C-3PO              masculine    75    -22.3   -31.1 
##  3 R2-D2              masculine    32    -65.3   -74.1 
##  4 Darth Vader        masculine   136     38.7    29.9 
##  5 Leia Organa        feminine     49    -48.3    -5.69
##  6 Owen Lars          masculine   120     22.7    13.9 
##  7 Beru Whitesun lars feminine     75    -22.3    20.3 
##  8 R5-D4              masculine    32    -65.3   -74.1 
##  9 Biggs Darklighter  masculine    84    -13.3   -22.1 
## 10 Obi-Wan Kenobi     masculine    77    -20.3   -29.1 
## # … with 77 more rows
```
]

---
# Grouping Metadata

.smallish[
Sometimes it can be helpful for you to programmatically refer to your grouping structure. `dplyr` offers four functions that return grouping metadata to help with this task:

Use these as stand-alone function calls:
- `group_data()` returns a dataframe with integer vectors specifying the rows that belong to each group
- `group_indices()` returns an integer vector of the same length as `nrow(grouped_data)` specifying which group index a row belongs to
- `group_vars()` returns a character vector of the colnames used for grouping
- `group_size()` returns an integer vector of length `n_groups()` specifying the number of rows within each group
- `n_group()` returns an integer vector of length 1 with the number of groups in the data

Use these inside other `dplyr` functions:
- `cur_data()` refers to the current (ungrouped) data for the current group
- `cur_group()` refers to the current group value(s)
- `cur_group_id()` refers to the current group unique numeric identifier
]

---
# `rowwise()`
.smallish[
`rowwise()` allows you to perform operations on data one row at a time (equivalent to `group_by()` each row or `for` looping down each row)

For example, to simulate normally distributed data with different parameters:

```r
df <- tibble(x = runif(6), y = runif(6), z = runif(6))
```

.pull-left[

```r
df %>%
  mutate(m = mean(c(x, y, z)))
```

```
## # A tibble: 6 x 4
##       x      y     z     m
##   <dbl>  <dbl> <dbl> <dbl>
## 1 0.266 0.945  0.687 0.553
## 2 0.372 0.661  0.384 0.553
## 3 0.573 0.629  0.770 0.553
## 4 0.908 0.0618 0.498 0.553
## 5 0.202 0.206  0.718 0.553
## 6 0.898 0.177  0.992 0.553
```
]

.pull-right[

```r
df %>%
  rowwise() %>%
  mutate(m = mean(c(x, y, z)))
```

```
## # A tibble: 6 x 4
## # Rowwise: 
##       x      y     z     m
##   <dbl>  <dbl> <dbl> <dbl>
## 1 0.266 0.945  0.687 0.632
## 2 0.372 0.661  0.384 0.472
## 3 0.573 0.629  0.770 0.657
## 4 0.908 0.0618 0.498 0.489
## 5 0.202 0.206  0.718 0.375
## 6 0.898 0.177  0.992 0.689
```
]

.footnote[A vectorized version: `df %>% mutate(m = rowMeans(select(., x, y, z)))`]

]
---
# `filter()`

`filter()` is used to subset rows from a dataframe
--

Similar to `[x, ]` except that it drops NAs
--

```r
filter(.data, ..., .preserve = FALSE)
```
--

- `.data` is the data to subset on
--

- `...` are the condition(s) that specify the subset
--

- `.preserve` controls the grouping of the returned dataframe<sup>1</sup>
--

.footnote[[1] If `.data` is grouped and `filter()` reduces the number of groups available in the data, the grouping will be recalculated (i.e., number of groups reduced) based on the new data when `.preserve` is set to `FALSE`]
---
## `filter()` Example

.small[

```r
starwars %>%
  filter(mass > mean(mass, na.rm = T))
```

```
## # A tibble: 10 x 14
##    name    height  mass hair_color  skin_color eye_color birth_year sex   gender
##    <chr>    <int> <dbl> <chr>       <chr>      <chr>          <dbl> <chr> <chr> 
##  1 Darth …    202   136 none        white      yellow          41.9 male  mascu…
##  2 Owen L…    178   120 brown, grey light      blue            52   male  mascu…
##  3 Chewba…    228   112 brown       unknown    blue           200   male  mascu…
##  4 Jabba …    175  1358 <NA>        green-tan… orange         600   herm… mascu…
##  5 Jek To…    180   110 brown       fair       blue            NA   male  mascu…
##  6 IG-88      200   140 none        metal      red             15   none  mascu…
##  7 Bossk      190   113 none        green      red             53   male  mascu…
##  8 Dexter…    198   102 none        brown      yellow          NA   male  mascu…
##  9 Grievo…    216   159 none        brown, wh… green, y…       NA   male  mascu…
## 10 Tarfful    234   136 brown       brown      blue            NA   male  mascu…
## # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>
```

Notice that you don't need to refer to the `mass` column as `starwars$mass` because `filter()` already knows the context from the `.data` argument.

👇 same as

```r
starwars[starwars$mass > mean(starwars$mass, na.rm = T), ]
```
]

---
## `filter()` Multiple Conditions

.smallish[
`filter()` can handle multiple conditions upon which to subset your data

When you pass `filter()` multiple conditions, they are combined with the `&` operator

```r
starwars %>%
  filter(mass > mean(mass, na.rm = T),
         eye_color %in% c("blue", "red"))
```

```
## # A tibble: 6 x 14
##   name     height  mass hair_color  skin_color eye_color birth_year sex   gender
##   <chr>     <int> <dbl> <chr>       <chr>      <chr>          <dbl> <chr> <chr> 
## 1 Owen La…    178   120 brown, grey light      blue              52 male  mascu…
## 2 Chewbac…    228   112 brown       unknown    blue             200 male  mascu…
## 3 Jek Ton…    180   110 brown       fair       blue              NA male  mascu…
## 4 IG-88       200   140 none        metal      red               15 none  mascu…
## 5 Bossk       190   113 none        green      red               53 male  mascu…
## 6 Tarfful     234   136 brown       brown      blue              NA male  mascu…
## # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>
```
]
---
To `filter()` using the or (`|`) operator, include it on one line as you would inside `[` `]` or `subset()`

```r
starwars %>%
  filter(mass > mean(mass, na.rm = T) | eye_color %in% c("blue", "red"))
```

```
## # A tibble: 28 x 14
##    name    height  mass hair_color  skin_color eye_color birth_year sex   gender
##    <chr>    <int> <dbl> <chr>       <chr>      <chr>          <dbl> <chr> <chr> 
##  1 Luke S…    172    77 blond       fair       blue            19   male  mascu…
##  2 R2-D2       96    32 <NA>        white, bl… red             33   none  mascu…
##  3 Darth …    202   136 none        white      yellow          41.9 male  mascu…
##  4 Owen L…    178   120 brown, grey light      blue            52   male  mascu…
##  5 Beru W…    165    75 brown       light      blue            47   fema… femin…
##  6 R5-D4       97    32 <NA>        white, red red             NA   none  mascu…
##  7 Anakin…    188    84 blond       fair       blue            41.9 male  mascu…
##  8 Wilhuf…    180    NA auburn, gr… fair       blue            64   male  mascu…
##  9 Chewba…    228   112 brown       unknown    blue           200   male  mascu…
## 10 Jabba …    175  1358 <NA>        green-tan… orange         600   herm… mascu…
## # … with 18 more rows, and 5 more variables: homeworld <chr>, species <chr>,
## #   films <list>, vehicles <list>, starships <list>
```

---
# Filter by Group

`filter()` will operate on grouped data:
--

👇 filter the entire dataset to characters whose mass is greater than average

```r
starwars %>%
  filter(mass > mean(mass, na.rm = T)) %>%
  dim()
```

```
## [1] 10 14
```
--

👇 filter the entire dataset to characters whose mass is greater than the average *within their eye color*

```r
starwars %>%
  group_by(eye_color) %>%
  filter(mass > mean(mass, na.rm = T)) %>%
  dim()
```

```
## [1] 27 14
```

---
# `slice()`

`slice()` allows you to subset rows using their integer locations

Similar to passing `[x, ]` a numeric vector
--

- `slice()` (positive integers keep rows, negative integers remove rows)
--

- `slice_head()`: keep the top `n` or `prop` (proportion) rows
--

- `slice_tail()`: keep the bottom `n` or `prop` (proportion) rows 
--

- `slice_min(order_by = col)`: keeps rows where `col` is at its minimum (`col` is a column in the data)
--

- `slice_max(order_by = col)`: keeps rows where `col` is at its maximum (`col` is a column in the data)
--

- `slice_sample()`: randomly selects `n` or `prop` rows
---
# `slice()` example

.small[
To keep the 2nd, 54th, and 83rd row in the data:

```r
starwars %>%
  slice(2, 54, 83)
```

```
## # A tibble: 3 x 14
##   name     height  mass hair_color skin_color eye_color birth_year sex   gender 
##   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr>  
## 1 C-3PO       167    75 <NA>       gold       yellow           112 none  mascul…
## 2 Yarael …    264    NA none       white      yellow            NA male  mascul…
## 3 Rey          NA    NA brown      light      hazel             NA fema… femini…
## # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>
```

To remove rows 1, 3-53, 55-82, and 84-87:

```r
starwars %>%
  slice(-1, -(3:53), -(55:82), -(84:87))
```

---
# `slice()`: A Warning

The starwars data has 87 rows:

```r
nrow(starwars)
```

```
## [1] 87
```

**Warning**: If you try to slice off rows that don't exist, R will not throw an error! It just doesn't subset those rows...

```r
starwars %>%
  slice(5, 100)
```

```
## # A tibble: 1 x 14
##   name     height  mass hair_color skin_color eye_color birth_year sex    gender
##   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr>  <chr> 
## 1 Leia Or…    150    49 brown      light      brown             19 female femin…
## # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>
```
---
# `slice()` by Group

If the `.data` are grouped, `slice()` will operate on each group of the data:
.smallish[

```r
starwars %>%
  filter(eye_color %in% c("black", "blue", "brown")) %>%
  group_by(eye_color) %>%
  slice(3, 10)
```

```
## # A tibble: 6 x 14
## # Groups:   eye_color [3]
##   name     height  mass hair_color skin_color  eye_color birth_year sex   gender
##   <chr>     <int> <dbl> <chr>      <chr>       <chr>          <dbl> <chr> <chr> 
## 1 Gasgano     122    NA none       white, blue black             NA male  mascu…
## 2 BB8          NA    NA none       none        black             NA none  mascu…
## 3 Beru Wh…    165    75 brown      light       blue              47 fema… femin…
## 4 Qui-Gon…    193    89 brown      fair        blue              92 male  mascu…
## 5 Han Solo    180    80 brown      fair        brown             29 male  mascu…
## 6 Shmi Sk…    163    NA black      fair        brown             72 fema… femin…
## # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>
```
]
---
.smallish[

## `select()`

`select()` is used to subset columns from a dataframe (similar to `[, x]`, but variable names don't need to be quoted or passed as a vector)

```r
starwars %>%
  select(character, mass, skin_color)
```

- `:` is used to select a range of consecutive columns

```r
starwars %>%
  select(birth_year:species)
```

- `!` is used to negate a selection of columns

```r
starwars %>%
  select(!birth_year, !vehicles, !starships)
```

- `-` is also used to negate a selection of columns

```r
starwars %>%
  select(-birth_year, -vehicles, -starships)
```

]

---
### `select()` helpers

.small[
You can dynamically select columns using a variety of helper functions from [tidyselect](https://tidyselect.r-lib.org/reference/language.html):

- `where()` allows you to select columns based on a function that returns a logical value

```r
starwars %>%
  select(where(is.character)) %>%
  colnames()
```

```
## [1] "name"       "hair_color" "skin_color" "eye_color"  "sex"       
## [6] "gender"     "homeworld"  "species"
```

- `starts_with()` selects columns that start with a specified prefix

```r
starwars %>%
  select(starts_with("h")) %>%
  colnames()
```

```
## [1] "height"     "hair_color" "homeworld"
```

- `ends_with()` selects all columns that end with a specified suffix

```r
starwars %>%
  select(ends_with("color")) %>%
  colnames()
```

```
## [1] "hair_color" "skin_color" "eye_color"
```
]
---
.smallish[
- `contains()` selects all columns that contain a string

```r
starwars %>%
  select(contains("me")) %>%
  colnames()
```

```
## [1] "name"      "homeworld"
```

- `num_range()` matches a numerical range (e.g., `phq_1`, `phq_2`, `phq_3`, ...)

```r
psych::bfi %>%
  select(num_range("A", 1:5)) %>%
  colnames()
```

```
## [1] "A1" "A2" "A3" "A4" "A5"
```

- `matches()` matches a regular expression

```r
starwars %>%
  select(matches("^[ns]|s$")) %>%
  colnames()
```

```
## [1] "name"       "mass"       "skin_color" "sex"        "species"   
## [6] "films"      "vehicles"   "starships"
```
]
---
.small[
- `all_of()` matches column names from a character vector

```r
hcols <- c("height", "hair_color", "homeworld")

starwars %>%
  select(all_of(hcols)) %>%
  colnames()
```

```
## [1] "height"     "hair_color" "homeworld"
```

- `any_of()` works the same way as `all_of()` except that no error is thrown for non-existent columns:

```r
hcols <- c("height", "hair_color", "homeworld", "Not_a_Column")
```

.pull-left[

```r
starwars %>%
* select(any_of(hcols)) %>%
  colnames()
```

```
## [1] "height"     "hair_color" "homeworld"
```
]

.pull-right[

```r
starwars %>%
* select(all_of(hcols)) %>%
  colnames()
```

```
## Error: Can't subset columns that don't exist.
## x Column `Not_a_Column` doesn't exist.
```
]

- `everything()` matches all columns

```r
starwars %>%
  select(everything()) %>%
  dim()
```

```
## [1] 87 14
```

]
---
# Column Renaming

You can rename variables while selecting them with `select()`:<sup>1</sup>

```r
starwars %>%
  select(character = name, weight = mass, height) %>%
  colnames()
```

```
## [1] "character" "weight"    "height"
```
.footnote[[1] Notice the syntax: `select(new = old)`]
--

However it is more explicitly done with `rename()`:

```r
starwars %>%
* rename(character = name, weight = mass) %>%
  select(character, weight, height) %>%
  colnames()
```

```
## [1] "character" "weight"    "height"
```

---
## Creating New Columns: `mutate()`
.small[
`mutate()` allows you to create a new column of data or modify existing columns

```r
starwars %>% 
* mutate(bmi = mass / (height/100)^2) %>% # convert cm to m
  select(mass, height, bmi) %>%
  head(3)
```

```
## # A tibble: 3 x 3
##    mass height   bmi
##   <dbl>  <int> <dbl>
## 1    77    172  26.0
## 2    75    167  26.9
## 3    32     96  34.7
```

We can also convert height to meters first, then calculate BMI:

.pull-left[

```r
starwars %>%
* mutate(height = height / 100,
*        bmi = mass / height^2) %>%
  select(mass, height, bmi) %>%
  slice(1, 3)
```

```
## # A tibble: 2 x 3
##    mass height   bmi
##   <dbl>  <dbl> <dbl>
## 1    77   1.72  26.0
## 2    32   0.96  34.7
```
]

.pull-right[
👈  multiple statements can be placed in the same `mutate()` call, and just like in `tibble()`, they build on each other dynamically
]

]
---
## `mutate(across())`
.small[
The helper functions discussed for `select()` can also be used with `mutate()` to modify/create several columns simultaneously, using the `across()` function

`across()` takes column specifiers (e.g., using the helper functions) and a function to apply to the relevant columns. This function needs to be an anonymous function using either:

- `function(args) instructions`
- `\(args) instructions` (base R lambda function)
- `~ instructions` (`purrr`-style lambda function, where the current column is referenced as `.x`)

Three ways to square values across all numeric columns:

```r
# Using function(args) instructions
starwars %>%
  mutate(across(where(is.numeric),
*               function(x) x^2))
```

```r
# Using \(args) instructions (R 4.1+ only)
starwars %>%
  mutate(across(where(is.numeric),
*               \(x) x^2))
```

```r
# Using dplyr ~ syntax (purrr-style lambda function)
starwars %>%
  mutate(across(where(is.numeric),
*               ~ .x^2))
```
]
---
## `mutate()` Example

.smallish[
Let's say we want to compute a sum score for all the `vehicles` and `starships` a character has piloted (much like creating a sum score across several items from a test). The `vehicles` and `starships` columns are both list columns where each element (i.e., cell) is a character vector of all the vehicles/starships the character has piloted

First we want to determine the length of each vector for both variables, which corresponds with the number of vehicles/starships the character has piloted:

```r
starwars <- starwars %>%
  rowwise() %>%
  mutate(across(vehicles:starships,
                ~ length(.x),
                .names = "{.col}_n")) %>%
  ungroup() # Remove rowwise() grouping structure

starwars %>% select(matches("^vehicles|^starships")) %>% head(4)
```

```
## # A tibble: 4 x 4
##   vehicles  starships vehicles_n starships_n
##   <list>    <list>         <int>       <int>
## 1 <chr [2]> <chr [2]>          2           2
## 2 <chr [0]> <chr [0]>          0           0
## 3 <chr [0]> <chr [0]>          0           0
## 4 <chr [0]> <chr [1]>          0           1
```
]
---
.smallish[
Then we want to take the sum of `vehicles_n` and `starships_n` and store it in a new variable called `total_piloted`:

```r
starwars <- starwars %>%
  rowwise() %>%
  mutate(total_piloted = sum(vehicles_n, starships_n, na.rm = T)) %>%
  ungroup()

starwars %>% select(vehicles_n, starships_n, total_piloted) %>% head(4)
```

```
## # A tibble: 4 x 3
##   vehicles_n starships_n total_piloted
##        <int>       <int>         <int>
## 1          2           2             4
## 2          0           0             0
## 3          0           0             0
## 4          0           1             1
```

This can also be done without using `rowwise()` (it is very slow with large dataframes) using the vectorized `rowSums()` function:

```r
starwars <- starwars %>% 
  mutate(total_piloted = rowSums(select(., vehicles_n, starships_n),
                                 na.rm = T))
```
☝️ `rowSums()` takes a dataframe as its first argument, so you need to use `select()` on `.`
]
---
## `if_else()`

.smallish[
Remember `if_else()`? We can use it inside mutate:

```r
starwars <- starwars %>%
  mutate(height_ordinal = if_else(height > (mean(height, na.rm = T) + sd(height, na.rm = T)), "tall",
                                  if_else(height < (mean(height, na.rm = T) - sd(height, na.rm = T)), "short",
                                          "average")))

starwars %>% select(starts_with("height")) %>% slice_sample(n = 10)
```

```
## # A tibble: 10 x 2
##    height height_ordinal
##     <int> <chr>         
##  1     NA <NA>          
##  2    188 average       
##  3    213 tall          
##  4    178 average       
##  5    191 average       
##  6     97 short         
##  7    170 average       
##  8    178 average       
##  9    163 average       
## 10    196 average
```
]

---
## `case_when()`

.smallish[
`case_when()` allows you to vectorize multiple `if_else()` statements:

```r
starwars <- starwars %>%
  mutate(height_ordinal = case_when(height > mean(height, na.rm = T) + sd(height, na.rm = T) ~ "tall",
                                    height < mean(height, na.rm = T) - sd(height, na.rm = T) ~ "short",
                                    is.na(height) ~ NA_character_, # need to handle NAs explicitly
                                    TRUE ~ "average")) # if none of the above are matched, this one will be default

starwars %>% select(starts_with("height")) %>% slice_sample(n = 10)
```

---
## Example Data

Consider the following example data from a fake EMA study (2 participants, 2 days, 3x/day):

.smallish[
.pull-left[

```
## # A tibble: 12 x 5
##       id   day  ping      x       y
##    <dbl> <dbl> <int>  <dbl>   <dbl>
##  1  1001     1     1  1.35  -0.163 
##  2  1001     1     2  9.21   1.73  
##  3  1001     1     3  1.99   0.0638
##  4  1001     2     1 -0.759 -1.21  
##  5  1001     2     2  3.86   1.98  
##  6  1001     2     3  0.863 -0.129 
##  7  1002     1     1  4.89   2.35  
##  8  1002     1     2  3.53  -1.12  
##  9  1002     1     3  0.539  1.64  
## 10  1002     2     1  1.10   1.21  
## 11  1002     2     2  9.48   5.13  
## 12  1002     2     3  6.30   2.44
```
]

.pull-right[

```r
d <- tibble(id = rep(c(1001, 1002), each = 6),
            day = rep(c(1, 1, 1, 2, 2, 2), 2),
            ping = rep(1:3, 4),
            x = rnorm(12, 5, 4),
            y = x*0.3 + rnorm(12))
```

- `id` = participant ID

- `day` = day in study

- `ping` = prompt within each day

- `x` = predictor variable

- `y` = outcome variable

]
]
---
## `lag()` and `lead()`

`lag()` allows you to create lagged variables for analysis (`lead()` does the opposite):

.small[

```r
d <- d %>%
  group_by(id) %>%
  mutate(x_l1 = lag(x), # lag(1)
         x_l2 = lag(x, 2)) # lag(2)

print(d)
```

```
## # A tibble: 12 x 7
## # Groups:   id [2]
##       id   day  ping      x       y   x_l1   x_l2
##    <dbl> <dbl> <int>  <dbl>   <dbl>  <dbl>  <dbl>
##  1  1001     1     1  1.35  -0.163  NA     NA    
##  2  1001     1     2  9.21   1.73    1.35  NA    
##  3  1001     1     3  1.99   0.0638  9.21   1.35 
##  4  1001     2     1 -0.759 -1.21    1.99   9.21 
##  5  1001     2     2  3.86   1.98   -0.759  1.99 
##  6  1001     2     3  0.863 -0.129   3.86  -0.759
##  7  1002     1     1  4.89   2.35   NA     NA    
##  8  1002     1     2  3.53  -1.12    4.89  NA    
##  9  1002     1     3  0.539  1.64    3.53   4.89 
## 10  1002     2     1  1.10   1.21    0.539  3.53 
## 11  1002     2     2  9.48   5.13    1.10   0.539
## 12  1002     2     3  6.30   2.44    9.48   1.10
```
]
---
## Targeting the `nth()` Elements

.small[
It is often useful to target the n<sup>th</sup> element of a vector when manipulating your data. For this, we can use the `first()`, `last()`, and `nth()` functions. For example, let's say we want to predict `y` with lag(1) `x`, but we don't want to use the last observation of the day to predict the first of the next day. We can make all relevant values NA for the analysis:

```r
d %>%
  group_by(id, day) %>%
  arrange(id, day, ping) %>% # Necessary to make sure last() works
  mutate(x_l1_NAlast = case_when(ping == last(ping) ~ NA_real_,
                       TRUE ~ x_l1))
```

```
## # A tibble: 12 x 8
## # Groups:   id, day [4]
##       id   day  ping      x       y   x_l1   x_l2 x_l1_NAlast
##    <dbl> <dbl> <int>  <dbl>   <dbl>  <dbl>  <dbl>       <dbl>
##  1  1001     1     1  1.35  -0.163  NA     NA          NA    
##  2  1001     1     2  9.21   1.73    1.35  NA           1.35 
##  3  1001     1     3  1.99   0.0638  9.21   1.35       NA    
##  4  1001     2     1 -0.759 -1.21    1.99   9.21        1.99 
##  5  1001     2     2  3.86   1.98   -0.759  1.99       -0.759
##  6  1001     2     3  0.863 -0.129   3.86  -0.759      NA    
##  7  1002     1     1  4.89   2.35   NA     NA          NA    
##  8  1002     1     2  3.53  -1.12    4.89  NA           4.89 
##  9  1002     1     3  0.539  1.64    3.53   4.89       NA    
## 10  1002     2     1  1.10   1.21    0.539  3.53        0.539
## 11  1002     2     2  9.48   5.13    1.10   0.539       1.10 
## 12  1002     2     3  6.30   2.44    9.48   1.10       NA
```
]

---
# `arrange()`

.small[

`arrange()` orders the rows of a dataframe by values within the specified columns

Values are arranged in ascending order my default. To arrange by a column in descending order, use the `desc()` function:

```r
# Arrange ascending by id, then day, then descending by ping
d %>%
  arrange(id, day, desc(ping))
```

```
## # A tibble: 12 x 7
## # Groups:   id [2]
##       id   day  ping      x       y   x_l1   x_l2
##    <dbl> <dbl> <int>  <dbl>   <dbl>  <dbl>  <dbl>
##  1  1001     1     3  1.99   0.0638  9.21   1.35 
##  2  1001     1     2  9.21   1.73    1.35  NA    
##  3  1001     1     1  1.35  -0.163  NA     NA    
##  4  1001     2     3  0.863 -0.129   3.86  -0.759
##  5  1001     2     2  3.86   1.98   -0.759  1.99 
##  6  1001     2     1 -0.759 -1.21    1.99   9.21 
##  7  1002     1     3  0.539  1.64    3.53   4.89 
##  8  1002     1     2  3.53  -1.12    4.89  NA    
##  9  1002     1     1  4.89   2.35   NA     NA    
## 10  1002     2     3  6.30   2.44    9.48   1.10 
## 11  1002     2     2  9.48   5.13    1.10   0.539
## 12  1002     2     1  1.10   1.21    0.539  3.53
```
]
---
# `coalesce()`

.small[
Sometimes you have two or more mutually exclusive variables that belong in the same column for analysis. `coalesce()` helps you combines these variables by finding the first non-`NA` value.

For example, suppose you have 3 columns representing how much participants like their Windows, Mac, or Linux computer (depending on which operating system they use), but you only care about their computer rating:
.pull-left[

```r
d_comp <- tibble(id = 1001:1003,
                 Windows = c(NA, NA, 3),
                 Mac = c(7, NA, NA),
                 Linux = c(NA, 10, NA))

print(d_comp)
```
]

.pull-right[

```
## # A tibble: 3 x 4
##      id Windows   Mac Linux
##   <int>   <dbl> <dbl> <dbl>
## 1  1001      NA     7    NA
## 2  1002      NA    NA    10
## 3  1003       3    NA    NA
```
]

Using `coalesce()` we can combine these into one variable for analysis:

```r
d_comp <- d_comp %>%
  mutate(rating = coalesce(Windows, Mac, Linux)) %>%
  select(id, rating)
```

<pre style="font-size:13.5px;">
## # A tibble: 3 x 5
##      id  rating
##   &lt;int&gt;   &lt;dbl&gt;
## 1  1001       7
## 2  1002      10
## 3  1003       3
</pre>

]
---
## `na_if()`

.small[
`na_if()` allows you to replace specified values with `NA`

```r
starwars %>%
* mutate(eye_color_cleaned = na_if(eye_color, "unknown")) %>%
  select(name, eye_color, eye_color_cleaned) %>%
  tail(5)
```

```
## # A tibble: 5 x 3
##   name           eye_color eye_color_cleaned
##   <chr>          <chr>     <chr>            
## 1 Rey            hazel     hazel            
## 2 Poe Dameron    brown     brown            
## 3 BB8            black     black            
## 4 Captain Phasma unknown   <NA>             
## 5 Padmé Amidala  brown     brown
```

This does the same exact thing as:

```r
dataframe$column[dataframe$column == val] <- NA 
```

```r
dataframe %>%
  mutate(column = if_else(column == value, NA, column))
```

```r
dataframe %>%
  mutate(column = case_when(column == value ~ NA,
                            TRUE ~ column)
```

]
---
## `relocate()`

.smallish[
`relocate()` allows you to rearrange columns in a dataframe

```r
relocate(.data, ..., .before = NULL, .after = NULL)
```

- `.data` is the dataframe to reorder

- `...` are the columns to move

- `.before` destination to move columns before (colname, index, `tidy-select`)

- `.after` destination to move columns after (colname, index, `tidy-select`)

For example, to move all the numeric columns before the character columns:

```r
starwars %>%
  relocate(where(is.numeric), .before = where(is.character)) %>%
  sapply(class)
```
.small[

```
##         height           mass     birth_year     vehicles_n    starships_n 
##      "integer"      "numeric"      "numeric"      "integer"      "integer" 
##  total_piloted           name     hair_color     skin_color      eye_color 
##      "integer"    "character"    "character"    "character"    "character" 
##            sex         gender      homeworld        species          films 
##    "character"    "character"    "character"    "character"         "list" 
##       vehicles      starships height_ordinal 
##         "list"         "list"    "character"
```
]
]
---
## `distinct()`

.small[
`distinct()` selects the unique rows from a data.frame (similar to the `unique.data.frame()` function in base R, but it is faster when you are working with a large dataframe). For example:

```r
df <- tibble(a = c(1, 1, 2, 2),
             b = c(1, 1, 2, 1),
             c = c(3, 3, 2, 3))
```

```
## # A tibble: 4 x 3
##       a     b     c
##   <dbl> <dbl> <dbl>
## 1     1     1     3
## 2     1     1     3
## 3     2     2     2
## 4     2     1     3
```

.pull-left[

```r
# Find distinct rows based on cols a and b
df %>%
  distinct(a, b)
```

```
## # A tibble: 3 x 2
##       a     b
##   <dbl> <dbl>
## 1     1     1
## 2     2     2
## 3     2     1
```
]

.pull-right[

```r
# To keep all columns
df %>%
  distinct(a, b, .keep_all = TRUE)
```

```
## # A tibble: 3 x 3
##       a     b     c
##   <dbl> <dbl> <dbl>
## 1     1     1     3
## 2     2     2     2
## 3     2     1     3
```
]
]
---
## `pull()`

.smallish[
`pull()` allows you to extract a column from a dataframe as a vector and is equivalent to `$`

```r
d$ping
```

```
##  [1] 1 2 3 1 2 3 1 2 3 1 2 3
```

```r
d %>% pull(ping)
```

```
##  [1] 1 2 3 1 2 3 1 2 3 1 2 3
```

`pull()` is useful when you want a column after manipulating a dataframe. For example, find the mean height of characters whose mass is greater than average and who are not orifinally from the planet Naboo:

```r
starwars %>%
  filter(mass > mean(mass, na.rm = T),
         homeworld != "Naboo") %>%
  pull(height) %>%
  mean()
```

```
## [1] 200.1111
```
]
---
class: inverse
# Summarizing Data
<div style="text-align: center;">
  <img src="images/dplyr.svg", alt="Tibble Logo" style="max-width: 70%;">
</div>
---
# `summarize()`

.small[
`summarize()` returns a dataframe with specified summary statistic(s) of your data with 1+ rows for each combination of grouping variables (1 for no grouping structure) and 1 column for each summary statistic (much like `tapply()`, but much more flexible with cleaner output)

```r
mtcars %>%
  summarize(nobs = n(),
            mpg_mean = mean(mpg, na.rm = T),
            mpg_sd = sd(mpg, na.rm = T))
```

```
##   nobs mpg_mean   mpg_sd
## 1   32 20.09062 6.026948
```

When we group the data before calling `summarize()` we will get summary statistics for each group:

```r
mtcars %>%
* group_by(cyl) %>%
  summarize(nobs = n(),
            mpg_mean = mean(mpg, na.rm = T),
            mpg_sd = sd(mpg, na.rm = T))
```

```
## # A tibble: 3 x 4
##     cyl  nobs mpg_mean mpg_sd
##   <dbl> <int>    <dbl>  <dbl>
## 1     4    11     26.7   4.51
## 2     6     7     19.7   1.45
## 3     8    14     15.1   2.56
```
]
---

## `summarize()` whatever you want

.smallish[

You can specify any function you want within `summarize()`

```r
mtcars %>%
  group_by(cyl) %>%
  summarize(nobs = n(), # n observations for each group
            percentN = (n() / nrow(.)) * 100, # percent of total observations
            from_total_obs = nrow(.) - n(), # distance from total observations
            mpg_sd_sample = sd(mpg, na.rm = T), # sample SD (n-1 in denominator)
            mpg_sd_pop = sqrt(sum((mpg - mean(mpg, na.rm = T))^2) / (n()))) # population SD (n in denominator)
```

```
## # A tibble: 3 x 6
##     cyl  nobs percentN from_total_obs mpg_sd_sample mpg_sd_pop
##   <dbl> <int>    <dbl>          <int>         <dbl>      <dbl>
## 1     4    11     34.4             21          4.51       4.30
## 2     6     7     21.9             25          1.45       1.35
## 3     8    14     43.8             18          2.56       2.47
```
]

---
## `summarize(across())`

.smallish[
Much like `mutate(across())` allowed us to manipulate several columns at once, `summarize(across())` allows us to programmatically target columns for summarizing

For example, to take the mean across all numeric variables:

```r
starwars %>%
  group_by(sex) %>%
  summarize(across(where(is.numeric),
                   ~ mean(.x, na.rm = T),
                   .names = "mean_{.col}"))
```
]

.small[

```
## # A tibble: 5 x 7
##   sex     mean_height mean_mass mean_birth_year mean_vehicles_n mean_starships_n
##   <chr>         <dbl>     <dbl>           <dbl>           <dbl>            <dbl>
## 1 female         169.      54.7            47.2           0.125            0.188
## 2 hermap…        175     1358             600             0                0    
## 3 male           179.      81.0            85.5           0.183            0.45 
## 4 none           131.      69.8            53.3           0                0    
## 5 <NA>           181.      48              62             0                0.25 
## # … with 1 more variable: mean_total_piloted <dbl>
```
]
---
### `summarize(across())` with Multiple Summary Functions

.smallish[
To summarize across columns with more than one summary function (e.g., count, mean, and sd), used a named list of summary functions:

```r
starwars %>%
  group_by(sex) %>%
  summarize(across(where(is.numeric),
*                  list(nobs = ~ sum(!is.na(.x)), mean = mean, sd = sd), # list of functions
                   na.rm = T)) # additional arguments passed to the functions
```

```
## # A tibble: 5 x 19
##   sex            height_nobs height_mean height_sd mass_nobs mass_mean mass_sd
##   <chr>                <int>       <dbl>     <dbl>     <int>     <dbl>   <dbl>
## 1 female                  15        169.     15.3          9      54.7    8.59
## 2 hermaphroditic           1        175      NA            1    1358     NA   
## 3 male                    57        179.     36.0         44      81.0   28.2 
## 4 none                     5        131.     49.1          4      69.8   51.0 
## 5 <NA>                     3        181.      2.89         1      48     NA   
## # … with 12 more variables: birth_year_nobs <int>, birth_year_mean <dbl>,
## #   birth_year_sd <dbl>, vehicles_n_nobs <int>, vehicles_n_mean <dbl>,
## #   vehicles_n_sd <dbl>, starships_n_nobs <int>, starships_n_mean <dbl>,
## #   starships_n_sd <dbl>, total_piloted_nobs <int>, total_piloted_mean <dbl>,
## #   total_piloted_sd <dbl>
```
]
---
class: inverse
# Merging Data
<div style="text-align: center;">
  <img src="images/dplyr.svg", alt="Tibble Logo" style="max-width: 70%;">
</div>
---
# Merging Dataframes

Merging dataframes together is an essential part of data management. For example:
--

- Merging baseline data with EMA data
--

- Merging self-report data to physiological data
--

- Merging patient data from multiple clinics
--

- Merging departmental revenue data to employee wage statistics
--

Sometimes marging data is a large part of the entire project, as in the [Washington Merged Longitudinal Administrative Data](https://dcollab.uw.edu/data/wmlad/) project at UW, which takes data from several WA State agencies to answer novel administrative questions.
---
### Questions to Ask Yourself When Merging

When merging datasets `A` and `B`:
--

- Which *rows* do you want to keep from each dataframe?
--

- Which *columns* do you want to keep from each dataframe?
--

- Which variable(s) determine whether rows *match*?
---
# Data Example for Merging

.smallish[
To keep things simple, let's use the following data to practice merging:

.pull-left[

```r
A <- tibble(ID = 0:6,
            x = rnorm(7, c(2, 2, 2, 2, 5, 5, 5, 5)),
            y = x*0.5 + rnorm(7))
```

```
## # A tibble: 7 x 3
##      ID     x     y
##   <int> <dbl> <dbl>
## 1     0  1.37 1.43 
## 2     1  2.18 1.67 
## 3     2  1.16 0.277
## 4     3  3.60 3.31 
## 5     4  5.33 3.05 
## 6     5  4.18 1.47 
## 7     6  5.49 0.529
```
]

.pull-right[

```r
B <- tibble(ID = 1:7,
            group = c(rep(1:2, each = 3), 2),
            age = sample(20:50, 7))  
```

```
## # A tibble: 7 x 3
##      ID group   age
##   <int> <dbl> <int>
## 1     1     1    24
## 2     2     1    21
## 3     3     1    29
## 4     4     2    44
## 5     5     2    31
## 6     6     2    34
## 7     7     2    20
```
]

Notice that `ID == 0` is not in `B` and `ID == 7` is not in `A`

We will use the `ID` column to merge the data (in the `by = ` argument)

]
---
## `left_join()`

.smallish[
`left_join(A, B)` keeps all rows from `A`, all cols from `A` and `B`

.pull-left[
**`dplyr`**

```r
left_join(A, B, by = "ID")
```

```
## # A tibble: 7 x 5
##      ID     x     y group   age
##   <int> <dbl> <dbl> <dbl> <int>
## 1     0  1.37 1.43     NA    NA
## 2     1  2.18 1.67      1    24
## 3     2  1.16 0.277     1    21
## 4     3  3.60 3.31      1    29
## 5     4  5.33 3.05      2    44
## 6     5  4.18 1.47      2    31
## 7     6  5.49 0.529     2    34
```
]

.pull-right[
**Base R Equivalent**

```r
merge(A, B, by = "ID", all.x = T)
```

```
##   ID        x         y group age
## 1  0 1.373546 1.4250978    NA  NA
## 2  1 2.183643 1.6676030     1  24
## 3  2 1.164371 0.2767973     1  21
## 4  3 3.595281 3.3094216     1  29
## 5  4 5.329508 3.0545971     2  44
## 6  5 4.179532 1.4685252     2  31
## 7  6 5.487429 0.5290146     2  34
```
]

☝️ `ID == 7` from `B` not in merged data ☝️

.footnote[Most of your joins will probably be `left_join()`]
]
---
## `right_join()`

.smallish[
`right_join(A, B)` keeps all rows from `B`, all cols from `A` and `B`

.pull-left[
**`dplyr`**

```r
right_join(A, B, by = "ID")
```

```
## # A tibble: 7 x 5
##      ID     x      y group   age
##   <int> <dbl>  <dbl> <dbl> <int>
## 1     1  2.18  1.67      1    24
## 2     2  1.16  0.277     1    21
## 3     3  3.60  3.31      1    29
## 4     4  5.33  3.05      2    44
## 5     5  4.18  1.47      2    31
## 6     6  5.49  0.529     2    34
## 7     7 NA    NA         2    20
```
]

.pull-right[
**Base R Equivalent**

```r
merge(A, B, by = "ID", all.y = T)
```

```
##   ID        x         y group age
## 1  1 2.183643 1.6676030     1  24
## 2  2 1.164371 0.2767973     1  21
## 3  3 3.595281 3.3094216     1  29
## 4  4 5.329508 3.0545971     2  44
## 5  5 4.179532 1.4685252     2  31
## 6  6 5.487429 0.5290146     2  34
## 7  7       NA        NA     2  20
```
]

☝️ `ID == 0` from `A` not in merged data ☝️

]
---
## `inner_join()`

.smallish[
`inner_join(A, B)` keeps only rows from `A` and `B` that match, all cols from `A` and `B`

.pull-left[
**`dplyr`**

```r
inner_join(A, B, by = "ID")
```

```
## # A tibble: 6 x 5
##      ID     x     y group   age
##   <int> <dbl> <dbl> <dbl> <int>
## 1     1  2.18 1.67      1    24
## 2     2  1.16 0.277     1    21
## 3     3  3.60 3.31      1    29
## 4     4  5.33 3.05      2    44
## 5     5  4.18 1.47      2    31
## 6     6  5.49 0.529     2    34
```
]

.pull-right[
**Base R Equivalent**

```r
merge(A, B, by = "ID", all = F)
```

☝️ Neither `ID == 7` from `A` nor `ID == 0` from `B` in merged data ☝️

.footnote[`all = F` is equivalent to `all.x = F, all.y = F` in base R's `merge()`]
]
---
## `full_join()`

.smallish[
`full_join(A, B)` keeps rows from both `A` and `B`, all cols from `A` and `B`

.pull-left[
**`dplyr`**

```r
full_join(A, B, by = "ID")
```

```
## # A tibble: 8 x 5
##      ID     x      y group   age
##   <int> <dbl>  <dbl> <dbl> <int>
## 1     0  1.37  1.43     NA    NA
## 2     1  2.18  1.67      1    24
## 3     2  1.16  0.277     1    21
## 4     3  3.60  3.31      1    29
## 5     4  5.33  3.05      2    44
## 6     5  4.18  1.47      2    31
## 7     6  5.49  0.529     2    34
## 8     7 NA    NA         2    20
```
]

.pull-right[
**Base R Equivalent**

```r
merge(A, B, by = "ID", all = T)
```

☝️ `ID == 7` from `A` and `ID == 0` from `B` are in merged data ☝️

.footnote[`all = T` is equivalent to `all.x = T, all.y = T` in base R's `merge()`]
]
---
## `semi_join()`

.smallish[
`semi_join(A, B)` keeps only rows from both `A` and `B` that match, only cols from `A`

.pull-left[
**`dplyr`**

```r
semi_join(A, B, by = "ID")
```

```
## # A tibble: 6 x 3
##      ID     x     y
##   <int> <dbl> <dbl>
## 1     1  2.18 1.67 
## 2     2  1.16 0.277
## 3     3  3.60 3.31 
## 4     4  5.33 3.05 
## 5     5  4.18 1.47 
## 6     6  5.49 0.529
```
]

.pull-right[
**Base R Equivalent**

```r
merge(A, B[, colnames(B) %in% colnames(A)],
      by = "ID", all = F)
```

```
##   ID        x         y
## 1  1 2.183643 1.6676030
## 2  2 1.164371 0.2767973
## 3  3 3.595281 3.3094216
## 4  4 5.329508 3.0545971
## 5  5 4.179532 1.4685252
## 6  6 5.487429 0.5290146
```
]
]
---
## `anti_join()`

.smallish[
`anti_join(A, B)` keeps rows from `A` that *do not* match `B`, only cols from `A`

.pull-left[
**`dplyr`**

```r
anti_join(A, B, by = "ID")
```

```
## # A tibble: 1 x 3
##      ID     x     y
##   <int> <dbl> <dbl>
## 1     0  1.37  1.43
```
]

.pull-right[
**Base R Equivalent**

```r
merge(A[!A$ID %in% B$ID, ],
      B[, colnames(B) %in% colnames(A)],
      by = "ID", all.x = T)
```

```
##   ID        x        y
## 1  0 1.373546 1.425098
```
]

.footnote[`all = T` is equivalent to `all.x = T, all.y = T` in base R's `merge()`]
]
---
## Matching With `by = `

- You can pass `by = ` a character vector of columns upon which to match. This is useful when merging nested data (e.g., data from clinic visits nested within patients)

```r
left_join(A, B, by = c("ID", "Date"))
merge(A, B, by = c("ID", "Date"), all.x = T)
```
--

- If the `by` columns used for merging don't have the same name (e.g., "PatientNum" and "PatientID"):

```r
left_join(A, B, by = c("PatientNum" = "PatientID"))
merge(A, B, by.x = "PatientNum", by.y = "PatientID", all.x = T)
```